{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic CNN part-of-speech tagger with Thinc\n",
    "\n",
    "This notebook shows how to implement a basic CNN for part-of-speech tagging model in Thinc (without external dependencies) and train the model on the Universal Dependencies [AnCora corpus](https://github.com/UniversalDependencies/UD_Spanish-AnCora). The tutorial shows three different workflows:\n",
    "\n",
    "1. Composing the model **in code** (basic usage)\n",
    "2. Composing the model **via a config file only** (mostly to demonstrate advanced usage of configs)\n",
    "3. Composing the model **in code and configuring it via config** (recommended)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"thinc>=8.0.0a0\" \"ml_datasets>=0.2.0a0\" \"tqdm>=4.41\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start by making sure the computation is performed on GPU if available. `prefer_gpu` should be called right after importing Thinc, and it returns a boolean indicating whether the GPU has been activated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from thinc.api import prefer_gpu\n",
    "\n",
    "prefer_gpu()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also define the following helper functions for loading the data, and training and evaluating a given model. Don't forget to call `model.initialize` with a batch of input and output data to initialize the model and fill in any missing shapes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ml_datasets\n",
    "from tqdm.notebook import tqdm\n",
    "from thinc.api import fix_random_seed\n",
    "\n",
    "fix_random_seed(0)\n",
    "\n",
    "def train_model(model, optimizer, n_iter, batch_size):\n",
    "    (train_X, train_y), (dev_X, dev_y) = ml_datasets.ud_ancora_pos_tags()\n",
    "    model.initialize(X=train_X[:5], Y=train_y[:5])\n",
    "    for n in range(n_iter):\n",
    "        loss = 0.0\n",
    "        batches = model.ops.multibatch(batch_size, train_X, train_y, shuffle=True)\n",
    "        for X, Y in tqdm(batches, leave=False):\n",
    "            Yh, backprop = model.begin_update(X)\n",
    "            d_loss = []\n",
    "            for i in range(len(Yh)):\n",
    "                d_loss.append(Yh[i] - Y[i])\n",
    "                loss += ((Yh[i] - Y[i]) ** 2).sum()\n",
    "            backprop(d_loss)\n",
    "            model.finish_update(optimizer)\n",
    "        score = evaluate(model, dev_X, dev_y, batch_size)\n",
    "        print(f\"{n}\\t{loss:.2f}\\t{score:.3f}\")\n",
    "        \n",
    "def evaluate(model, dev_X, dev_Y, batch_size):\n",
    "    correct = 0\n",
    "    total = 0\n",
    "    for X, Y in model.ops.multibatch(batch_size, dev_X, dev_Y):\n",
    "        Yh = model.predict(X)\n",
    "        for yh, y in zip(Yh, Y):\n",
    "            correct += (y.argmax(axis=1) == yh.argmax(axis=1)).sum()\n",
    "            total += y.shape[0]\n",
    "    return float(correct / total)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 1. Composing the model in code\n",
    "\n",
    "Here's the model definition, using the `>>` operator for the `chain` combinator. The `strings2arrays` transform converts a sequence of strings to a list of arrays. `with_array` transforms sequences (the sequences of arrays) into a contiguous 2-dimensional array on the way into and out of the model it wraps. This means our model has the following signature: `Model[Sequence[str], Sequence[Array2d]]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from thinc.api import Model, chain, strings2arrays, with_array, HashEmbed, expand_window, Relu, Softmax, Adam, warmup_linear\n",
    "\n",
    "width = 32\n",
    "vector_width = 16\n",
    "nr_classes = 17\n",
    "learn_rate = 0.001\n",
    "n_iter = 10\n",
    "batch_size = 128\n",
    "\n",
    "with Model.define_operators({\">>\": chain}):\n",
    "    model = strings2arrays() >> with_array(\n",
    "        HashEmbed(nO=width, nV=vector_width, column=0)\n",
    "        >> expand_window(window_size=1)\n",
    "        >> Relu(nO=width, nI=width * 3)\n",
    "        >> Relu(nO=width, nI=width)\n",
    "        >> Softmax(nO=nr_classes, nI=width)\n",
    "    )\n",
    "optimizer = Adam(learn_rate)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_model(model, optimizer, n_iter, batch_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Composing the model via a config file\n",
    "\n",
    "Thinc's config system lets describe **arbitrary trees of objects**. The config can include values like hyperparameters or training settings, or references to functions and the values of their arguments. Thinc will then construct the config **bottom-up** – so you can define one function with its arguments, and then pass the return value into another function.\n",
    "\n",
    "If we want to rebuild the model defined above in a config file, we first need to break down its structure:\n",
    "\n",
    "* `chain` (any number of positional arguments)\n",
    "  * `strings2arrays` (no arguments)\n",
    "  * `with_array` (one argument **layer**)\n",
    "    * **layer:** `chain` (any number of positional arguments)\n",
    "      * `HashEmbed`\n",
    "      * `expand_window`\n",
    "      * `Relu`\n",
    "      * `Relu`\n",
    "      * `Softmax`\n",
    "\n",
    "`chain` takes a variable number of positional arguments (the layers to compose). In the config, positional arguments can be expressed using `*` in the dot notation. For example, `model.layer` could describe a function passed to `model` as the argument `layer`, while `model.*.relu` defines a positional argument passed to `model`. The name of the argument, e.g. `relu` – doesn't matter in this case. It just needs to be unique.\n",
    "\n",
    "> ⚠️ **Important note:** This example is mostly intended to show what's possible. We don't recommend \"programming via config files\" as shown here, since it doesn't really solve any problem and makes the model definition just as complicated. Instead, we recommend a hybrid approach: wrap the model definition in a registed function and configure it via the config."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CONFIG = \"\"\"\n",
    "[hyper_params]\n",
    "width = 32\n",
    "vector_width = 16\n",
    "learn_rate = 0.001\n",
    "\n",
    "[training]\n",
    "n_iter = 10\n",
    "batch_size = 128\n",
    "\n",
    "[model]\n",
    "@layers = \"chain.v1\"\n",
    "\n",
    "[model.*.strings2arrays]\n",
    "@layers = \"strings2arrays.v1\"\n",
    "\n",
    "[model.*.with_array]\n",
    "@layers = \"with_array.v1\"\n",
    "\n",
    "[model.*.with_array.layer]\n",
    "@layers = \"chain.v1\"\n",
    "\n",
    "[model.*.with_array.layer.*.hashembed]\n",
    "@layers = \"HashEmbed.v1\"\n",
    "nO = ${hyper_params:width}\n",
    "nV = ${hyper_params:vector_width}\n",
    "column = 0\n",
    "\n",
    "[model.*.with_array.layer.*.expand_window]\n",
    "@layers = \"expand_window.v1\"\n",
    "window_size = 1\n",
    "\n",
    "[model.*.with_array.layer.*.relu1]\n",
    "@layers = \"Relu.v1\"\n",
    "nO = ${hyper_params:width}\n",
    "nI = 96\n",
    "\n",
    "[model.*.with_array.layer.*.relu2]\n",
    "@layers = \"Relu.v1\"\n",
    "nO = ${hyper_params:width}\n",
    "nI = ${hyper_params:width}\n",
    "\n",
    "[model.*.with_array.layer.*.softmax]\n",
    "@layers = \"Softmax.v1\"\n",
    "nO = 17\n",
    "nI = ${hyper_params:width}\n",
    "\n",
    "[optimizer]\n",
    "@optimizers = \"Adam.v1\"\n",
    "learn_rate = ${hyper_params:learn_rate}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When the config is loaded, it's first parsed as a dictionary and all references to values from other sections, e.g. `${hyper_params:width}` are replaced. The result is a nested dictionary describing the objects defined in the config."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from thinc.api import registry, Config\n",
    "\n",
    "config = Config().from_str(CONFIG)\n",
    "config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`registry.resolve` then creates the objects and calls the functions **bottom-up**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "C = registry.resolve(config)\n",
    "C"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have a model, optimizer and training settings, built from the config, and can use them to train the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = C[\"model\"]\n",
    "optimizer = C[\"optimizer\"]\n",
    "n_iter = C[\"training\"][\"n_iter\"]\n",
    "batch_size = C[\"training\"][\"batch_size\"]\n",
    "train_model(model, optimizer, n_iter, batch_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## Composing the model with code and config\n",
    "\n",
    "The `@thinc.registry` decorator lets you register your own layers and model definitions, which can then be referenced in config files. This approach gives you the most flexibility, while also keeping your config and model definitions concise.\n",
    "\n",
    "> 💡 The function you register will be filled in by the config – e.g. the value of `width` defined in the config block will be passed in as the argument `width`. If arguments are missing, you'll see a validation error. If you're using **type hints** in the function, the values will be parsed to ensure they always have the right type. If they're invalid – e.g. if you're passing in a list as the value of `width` – you'll see an error. This makes it easier to prevent bugs caused by incorrect values lower down in the network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import thinc\n",
    "from thinc.api import Model, chain, strings2arrays, with_array, HashEmbed, expand_window, Relu, Softmax, Adam, warmup_linear\n",
    "\n",
    "@thinc.registry.layers(\"cnn_tagger.v1\")\n",
    "def create_cnn_tagger(width: int, vector_width: int, nr_classes: int = 17):\n",
    "    with Model.define_operators({\">>\": chain}):\n",
    "        model = strings2arrays() >> with_array(\n",
    "            HashEmbed(nO=width, nV=vector_width, column=0)\n",
    "            >> expand_window(window_size=1)\n",
    "            >> Relu(nO=width, nI=width * 3)\n",
    "            >> Relu(nO=width, nI=width)\n",
    "            >> Softmax(nO=nr_classes, nI=width)\n",
    "        )\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The config would then only need to define one model block with `@layers = \"cnn_tagger.v1\"` and the function arguments. Whether you move them out to a section like `[hyper_params]` or just hard-code them into the block is up to you. The advantage of a separate section is that the values are **preserved in the parsed config object** (and not just passed into the function), so you can always print and view them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "CONFIG = \"\"\"\n",
    "[hyper_params]\n",
    "width = 32\n",
    "vector_width = 16\n",
    "learn_rate = 0.001\n",
    "\n",
    "[training]\n",
    "n_iter = 10\n",
    "batch_size = 128\n",
    "\n",
    "[model]\n",
    "@layers = \"cnn_tagger.v1\"\n",
    "width = ${hyper_params:width}\n",
    "vector_width = ${hyper_params:vector_width}\n",
    "nr_classes = 17\n",
    "\n",
    "[optimizer]\n",
    "@optimizers = \"Adam.v1\"\n",
    "learn_rate = ${hyper_params:learn_rate}\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "C = registry.resolve(Config().from_str(CONFIG))\n",
    "C"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = C[\"model\"]\n",
    "optimizer = C[\"optimizer\"]\n",
    "n_iter = C[\"training\"][\"n_iter\"]\n",
    "batch_size = C[\"training\"][\"batch_size\"]\n",
    "train_model(model, optimizer, n_iter, batch_size)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.7.2 64-bit ('.env': venv)",
   "language": "python",
   "name": "python37"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
